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A MODEL  FOR  THE  ANALYSIS  OF  MARKOVIAN  DECISION  PROCESSES 
WITH  UNOBSERVABLE  STATES  AND  UNOBSERVABLE  COSTS 

Janes  D.  Steele* 

The  Rand  Corporation,  Santa  Monica,  California 


INTRODUCTION 

Consider  the  Markovian  Decision  Process  (MDP)  defined  by  the 
following  objects; 

State  SpaceS  ■ {1 , 2,  3,  • . . , N)  , for  finite  N, 

Action  Space  A « {»j  a2 , . . . , aH>  , for  finite  M, 


Cost  Set  C - {C(i , aj ) : ieS,  a^eA) 

Transition  Probabilities  * {q  j j (a  : i , j eS , a^eA)  , 


Discount  Factor  a , such  that  o<a<l. 


The  problem  is  to  find  a policy  for  taking  actions  which  minimizes  the 
total  expected  discounted  cost  over  the  infinite  future,  given  the 
initial  state  of  the  process. 

A stationary  policy  for  (MDP)  is  defined  as  a map  f : S-*A. 

Howard  [2]  analyzed  (MDP's)  having  finite  state  and  action  spaces  and 
proved  that  an  optimal  stationary  policy  (i.e.  a stationary  policy 
which  minimizes  the  total  expected  discounted  cost)  always  exists. 

The  Howard  Policy  Improvement  Routine  is  a method  by  which  an  optimal 
stationary  policy  for  (MDP)  may  be  found. 

Suppose  now,  that  we  are  given  the  (MDP)  as  defined  above,  but 
that  we  are  not  allowed  to  observe  the  state  at  any  observation  point 
t ■ 0,  1,2,  ...  . Suppose  also,  that  we  are  not  allowed  tc  observe 
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the  cost  C(Xt,at)  at  any  observation  points  t * 0,  I,  2,  ...  . In 

other  words,  the  total  cost  wiii  be  assessed  at  infinity.  Finally, 
suppose  that  we  are  allowed  to  observe  the  initial  probability  distri- 
bution over  the  state  space  S.  In  this  paper  we  develop  a model  for 
analyzing  this  problem  and  present  some  preliminary  results.  A rather 
thorough  treatment  of  the  problem  of  unobservable  states  and  unobser- 
vable costs  for  (MOP's)  having  two  states  and  two  actions  may  be  found 
in  Steele  f3]. 

THE  MODEL 

In  an  effort  to  analyze  the  above  problem,  we  define  the  follow- 
ing objects. 

? « (All  probability  distributions  over  S) 


{P  » (Pj,  P2,  ....  PN)  e£N  :of.Pj^l,  2^  Pj*i, 


i-l,  2,  ...,  N)  , 


where  Eu  is  N-dimcnsional  Euclidean  space  and  we  let  P.  be  the 
N I 


probability  of  being  in  state  i; 


the  set  ft  ■ A * {a^,  a2,  ...,  aM);  the  transition 


matrices  : Q(aJ  * [q..(a„)];  and  tiie  cost  vectors 

5\  I J IN 

C (aK)  * (C[l,aKJ,  ....  C [N  ,*y] ) . 

We  note  that  if  the  distribution  over  S is  Pe^  an  1 we  take  action 
a eft,  then  the  new  distribution  over  S will  be  gi'v.n  by  P'*PQ(a,.)  . 

r\  

The  expected  cost,  c(P,aK),  of  having  the  distribution  P and  taking 
action  a.,  will  be  given  as  the  inner  product 


n 

ft(P,aK)  = <P,C(-sK)>  * Z P.C { i ,aK) . 
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At  this  point,  we  note  that  the  new  distribution  ?'  depends  only  on 
the  current  distribution  P and  the  current  action  a„,  i.e.  P'=PQ(a.J, 

* l\ 

therefore,  we  see  that  we  now  have  a new  Ma rknv  an  Decision  Process 
(MdP)  defined  by  the  objects 

State  Space  & * {all  probability  distributions  over  S}  , 

Action  Space  k * A = (a^a^, . . . .a^i  , and 

Cost  Set  £ 15  {£(P,a^)  : Pelt,  a^cK} 

Discount  factor  a,  such  that  o<a<l. 

'V 

The  set  of  all  stationary  policies  for  (MDP)  is  given  by 

F * {f  : ? -»  A*A).  For  any  such  FeF  and  initial  state  P e£,  the 

o 

total  expected  discounted  cost  is  given  by 

Vf  (To>  ■ tlo  “* 

* t!o  <Pt ,c (f (Pt) 3 > , where 

Pt  * PQ  Q(f[P0J)  Q (flP,))  ...  Q(f[Pt_,])  for  t=l ,2,3, .. . . 

The  (hftp)  as  defined  above  (having  uncountable  state  space  and 
finite  action  space)  belongs  to  the  class  of  problems  analyzed  by 
Blackwell  [1],  His  analysis  showed  that  an  optimal  stationary  policy 
always  exists  and  that  the  Howard  Policy  Improvement  Routine  may  be 
extended  to  this  problem.  However,  in  the  finite  state-finite  action  problems 
the  set  of  all  stationary  policies  is  finite  and,  therefore,  the 
Howard  Policy  Improvement  Routine  will  produce  an  optimal  stationary 
policy  in  a finite  number  of  steps.  In  'he  uncountable  state-finite 
action  problem,  the  set  of  all  stationary  policies  is  uncountable  and, 
therefore,  the  Howard  Policy  Improvement  Routine  cannot,  in  general, 
be  used  as  a method  for  actually  finding  an  optimal  policy. 
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CONVEX-STATIONARY  POLICIES 

With  F as  the  set  of  all  stationary  policies  for  (M^P) , we 
define  the  set  ref,  of  all  convex-stationary  policies  for  {KifP} , as 

? * {feF  : f ' (a^)  is  a convex  set  for  each  a^eA}  . 

CONSTANT  SEQUENCE  POLICIES  ' 

Given  the  action  space  A * {a^ ,a2» . . . ,a^} , we  define  A * AxAx xA, 

N-factors,  N * 1,2,3,-..,  to  be  the  set  of  all  sequences  of  length  N of 

ao 

elements  of  A,  and  we  define  A * AxAxAx...,  to  be  the  set  of  all  infinte 

sequences  of  elements  of  A.  For  any  finite  sequence  ScA‘  , N >J  , we  define 
K NK  K 

the  sequence  S cA  to  be  the  sequence  S * s,s,...,s,  K-factors,  and 
S eA°°,  to  be  the  sequence  S°°  = s,s,s,...  . For  any  finite  sequences  Sj 
and  S2  we  define  A (S ^ i S2)  * AtS^.S^}  to  be  the  set  consisting  of  the 
two  action  sequences  Sj  and  S^  and 

A(S  ;S2r  * A(S  | jS^)  xA(Sj  ;S2)X  • • • , to  be  the  set  of  all  infinite 
sequences  of  elements  of  A(S ^ ; S^) - For  any  finite  sequence  S » a j , 
a_,  ....  au,  N> 1 , a.eA,  we  define 

for  Fe$  and  F * F,  to  be  the  cost  of  starting  at  P and  operating  for 

® t f)  • • 

N time  periods  when  using  the  i*:-  entry  in  Sf  1 <j  <N , as  the  action  to 

be  taken  at  the  i— observation  time.  We  say  that  L-(.)  is  the  cost 

of  using  the  finite  sequence  $,  for  all  initial  PeS.  If  SeA  , we 

define  the  constant  sequence  policy  (S) , to  be  the  policy  which  uses 

the  sequence  S when  starting  at  any  initial  Pe£,  we  define  (A  ) to 

be  the  set  of  all  such  policies  and  we  use  (.),  in  place  of 

L(S)U*  for  the  cost  of  the  policy  (S) . 
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IMHEO 1 ATE  RESULTS 

LEHMA  1:  The  cost  function,  V^,  for  any  policy  (r)  e (A  ) is 

1 inear  on 


00 

Proof:  Let  the  sequence  reA  be  given  by  r ■ aj^^,...  . Now, 
for  any  points  P,  P',  P"  in  £ such  that  P * AP'  + (l-A)P"  for  some 
Ae[o,l]  we  nave  (with  Q(aQ)  = the  Identity  operator). 


'<r> 


(P)  - 


t£o 


* t£o  o'  <PH(ao)  Qlap  ...  d(at),  C(a(tl)> 


- *&«'  <P'<lU0)  ...  S(at),  C(att,)  > 


+(I-X)  t|o  o'  <P"  Q<a0)  ...  Q(at),  C(att|)) 


or 


AV(r)  (P')  + (1  -A)  V{r)  (P") 


V(r)  (AP'  + [1-a]P")  - xV(rJ  (P')  + ( 1 - A ) V(r)  (P") . Q.E.D. 


For  any  stationary  policy  f and  any  Pe^,  we  define  the 
sequence  S(P,f)  eA  , 


S(P,f)  - (atJ,  by  at+]  » f(Pt), 

F+J  " FtQ(flFtJ)’  f0r  1 “ O'1’2’----  and  F0  * P- 

We  say  that  S(P,f)  is  the  sequence  generated  at  F when  the  stationary 
pol icy  f is  used. 

LEMMA  2:  For  any  stationary  policy  f and  any  Pe^,  we  have 


i 


I 

j 


1 


t 


— — * > . - - * - ' — I u * 

i 

i 
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(F)  « v(s{F  f])  w*'ere  Vf  (P)  >s  the  cost  °f  using 

the  stationary  policy  f jnd  starting  at  F. 

Proof:  By  definition  of  S(F,f).  Q.E.D. 

THEOREM  1 : The  O'  imal  cost  function  is  concave  on  f 

t 

Proof:  Let  f be  optimal.  Lemmas  1 and  2 show  that  at  each  Pe£ 

■ v(s[p,f*n  o')  - v(S) 

Therefore,  we  see  that  at  each  point  Pe$,  (?) i s the  i nf  imum  over  | 

a set  of  linear  functions  and  hence  V,.  is  concave.  Q.E.D. 

f ! 

I 

Next  we  prove  that  the  optimal  cost  function  is  continuous  on  j 

£ by  making  use  of  the  following  representation.  Let  B be  the  set  ! 

of  all  bounded  Baire  functions  on  Define  a norm  on  B by 

1 1 V | ( * £UP  |v(F)  j , for  any  VeB. 

PeS 

Next,  define  the  operator  U ; B-*B  by 

(Uv)  (?)  * min  (L  (F)  + aV(FU[a  ])}. 

..  —A  3 t/ 


In  Lemma  3i  we  state  some  results  presented  in  Reference  [l]. 


LEMMA  3- 
(i) 


(ii) 


(iii) 


U is  a contraction  operator 

For  any  VeB,  the  sequence  Un*UnV  con'^-ges  to  the 
optimal  cost  function  V^A. 

Th*.  optimal  cost  function,  V^ft,  is  the  unique  solution 
to  UVfA  - Vf*. 


We  now  have  the  following  Theorem. 


3533® 


T*?**a  ^r&?r?-iM*>  •-+  x.- 


mmmm 
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THEQREH  2;  Th«  optimal  cost  function  is  uniformly  continuous 
on 


Proof:  For  any  ueB,  we  have 


u « U°u  -*■  V£x  as  n-**8. 
n f* 

We  also  have 

au  (PdlaJ)} 
n i\ 

L (.)  is  continuous  for  each 
aK 

« u is  continuous.  The 
o 

convergence  of  j -*■  V,.  is  uniform  because  U is  a contraction 
' nr* 

operator,  i.e. 

Ilu„  - »f,||i  IMfJ|.  a.E.D. 


Vi  ^ * 


min 

vA 


(L_  (P)  + 


for  PcS.  therefore,  we  see  that  since 

a„eA,  each  u will  he  continuous  if  u 
K n 
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